The following content is based on the tidymodels documentation and Alisson Hill’s tidymodels workshop.
In this tutorial, we’ll explore a tidymodels package, recipes, which is designed to help you preprocess your data before training your model.
Recipes are built as a series of preprocessing steps, such as:
and so on. If you are familiar with R’s formula interface, a lot of this might sound familiar and like what a formula already does. Recipes can be used to do many of the same things, but they have a much wider range of possibilities. This article shows how to use recipes for modeling.
In summary, the idea of the recipes package is to define a recipe or blueprint that can be used to sequentially define the encodings and preprocessing of the data (i.e. “feature engineering”) before we build our models.
Import data and split the data into training and testing sets using initial_split()
library(tidyverse)
library(tidymodels)
ames <- read_csv("https://raw.githubusercontent.com/kirenz/datasets/master/ames.csv")
ames <- ames %>%
select(-matches("Qu"))
set.seed(100)
new_split <- initial_split(ames)
new_train <- training(new_split)
new_test <- testing(new_split)Next, we use a recipe() to build a set of steps for data preprocessing and feature engineering.
recipe() what our model is going to be (using a formula here) and what our training data is.step_novel() will convert all nominal variables to factors.ames_rec <-
recipe(Sale_Price ~ ., data = new_train) %>%
step_novel(all_nominal(), -all_outcomes()) %>%
step_dummy(all_nominal()) %>%
step_zv(all_predictors()) %>%
step_normalize(all_predictors())
# Show the content of our recipe
ames_rec## Data Recipe
##
## Inputs:
##
## role #variables
## outcome 1
## predictor 73
##
## Operations:
##
## Novel factor level assignment for all_nominal(), -all_outcomes()
## Dummy variables from all_nominal()
## Zero variance filter on all_predictors()
## Centering and scaling for all_predictors()
Finally, we prep() the recipe(). This means we actually do something with the steps and our training data.
ames_prep <- prep(ames_rec)Print a summary of our prepped recipe:
summary(ames_prep)To obtain the Dataframe from the prepped recipe, we use the function juice(). When we juice() the recipe, we squeeze that training data back out, transformed in the ways we specified:
juice(ames_prep)